Computer and Modernization ›› 2010, Vol. 1 ›› Issue (6): 187-0190.doi: 10.3969/j.issn.1006-2475.2010.06.053

• 应用与开发 • Previous Articles    

Realization and Evaluation of Paodingjieniu Chinese Segmentation in Nutch

SUN Dian-zhe1, WEI Hai-ping2, CHEN Yan1   

  1. 1.Graduate School, Liaoning Shihua University, Fushun 113001, China; 2.School of Computer and Communication Engineering, Liaoning Shihua University, Fushun 113001, China
  • Received:2010-02-22 Revised:1900-01-01 Online:2010-07-01 Published:2010-07-01

Abstract: Chinese word segmentation is one of main challenges for search engine. By analyzing the scoring mechanism of the document of Nutch, for the situation that word segmentation of Chinese word segmentation module of Nutch does not conform to Chinese language habit, this paper proposes to use Paodingjieniu Chinese word segmentation module based on dictionary to segment the data collected by Nutch, describes the method that how to realize Paodingjieniu Chinese word segmentation module on Nutch, then tests the word segmentation module. Experiments show that the word segmentation result of Paodingjieniue word segmentation module more conforms to Chinese language habit, and the coverage of terms are more balanced for documents, in addition, 20%~65% of the storage space of index file is saved.

Key words: Chinese word segmentation, scoring mechanism, Paodingjieniu

CLC Number: